The UK Data Service

Who we are

  • Five partner universities

    • UK Data Archive, University of Essex (lead partner)
    • Cathie Marsh Institute, University of Manchester
    • Jisc, University of Manchester
    • EDINA, University of Edinburgh
    • University College London
  • 90+ staff

  • Since 2012 (UKDA \(\rightarrow\) 1967); curates national data since 2003

What we do

  • The main single point of access for UK social science data
  • Secondary data collection, curation and access
  • Training and user support
  • Communication and user engagement
  • Impact
  • … Key part of the UK social science research infrastructure, funded by the UKRI/ESRC

Our data…

  • UK social survey microdata:

    • Cross-sectional: large government and academic surveys
    • Longitudinal: major studies following people over time
  • International data: survey data, aggregate databases

  • Census tables and individual data – current and historical

  • Business microdata and administrative data

  • Qualitative data: multimedia files and interview transcripts

Our training in practice

  1. Webinars and online workshops
  2. User Conferences: four main user conferences each year
  3. Drop-in sessions: Survey, Computational Social Science and SecureLab
  4. Online learning materials: find key resources on our Learning Hub
  5. Helpdesk for individual data queries
  6. Check out our YouTube channel

1. Why a webinar series on data linkage?

The world is changing

  • Once upon a time: mostly census and surveys (lots!)
  • Emerging / new kinds of data in the last 20 years
    1. Administrative data
    2. Digital trace data (i.e. social media, web…)
    3. Smart data (i.e. flow, device generated)
    • 2 and 3: genuinely new data
    • 1: Digitalisation of records \(\longrightarrow\) greater availability
  • Increased demand for personal insights: the Monitored Self
  • Potential for new research avenues at a lower cost…

Growing role of non-survey data

  • Why interest is growing
    • New and previously unavailable measurements
    • Large scale, high frequency, low marginal cost
    • Attractiveness of ‘harder’ type of data…
    • … Particularly in social epidemiology and socio-economic research
  • But:
    • Collected for administrative purposes, not research
    • Selective coverage \(\longrightarrow\) population exclusions
    • Measurement error, changing definitions, policy artefacts
    • Limited socio-demographic and subjective information

The survey landscape

  • Data collection challenges…
    • Rising costs under tighter budgets
    • Recruitment is harder (reach, refusals, panel fatigue)
    • Alleviation comes at a cost: larger samples, incentives
    • Growth of online and mixed-mode designs
  • … But they remain essential:
    • Only source for attitudes, beliefs, motivations, well-being
    • Rich socio-demographic information not captured elsewhere: detailed occupation, social class, ethnicity
    • Theory-driven, validated instruments (ex: GHQ)
    • Only tested tool for population representative data, hard to reach groups and subgroup analysis

In a nutshell

  • Budgetary pressures on survey data
  • Wealth of cheaper, but narrowly focused, often unrepresentative new forms of data
  • Data integration is a win-win: potential to improve (validation and enhancement) of both kinds of data (Benzeval 2020)
  • At the same time:
    • Increased complexity of the data provision landscape: especially for non experts
    • Linkage is still limited (but growing) practice and few linked datasets are available for secondary research
    • Need to adapt the skills training/capacity building

2. Exploring
integrated data

Working definition

  • Combining different sources of data ie:

    • Survey data \(\leftrightsquigarrow\) survey data
    • Survey data \(\leftrightsquigarrow\) non survey data
    • Non survey data \(\leftrightsquigarrow\) non survey data
  • That includes a shared unit of observation (individual, household, area…)

  • … In a coherent way in order to:

    • Validate or
    • .. enhance the original data
  • Bidirectional

  • In this presentation: linkage = integration

Validation example

  • How reliable are population survey-based estimates of chronic diseases?
  • Data linkage to validate prevalence of selected chronic conditions:
    • Angina, myocardial infarction, heart failure, and asthma
  • Link 11,323 adults from the 2013 and 2014 Welsh Health Survey to clinical data
  • Secure Anonymised Information Linkage (SAIL) Databank
  • Results: quality depends on condition:
    • Less agreement for cardiovascular, better for asthma
    • Potentially cheaper
    • But not devoid of technical difficulties

Kinds of data linked to survey data

Administrative data

  • Usually arising from the interaction between:

    • A public organisation or body…
    • … the unit for which records are produced (ie people)
  • Exemples:

    • Registry data: birth, death, marriage records,
    • Health records, educational transcripts
    • Government records: benefits, earnings/income
    • Financial reports ie credit ratings, mortgage application
  • In the UK: enabled Digital Economy Act 2017:

    … de-identified data from government service providers, excluding NHS data, as part of their day-to-day functions, may be shared for public good research

Hospital episodes data

  • NHS data about all hospital admissions in England.
  • Four datasets:
    • Episodes of using: Accident and Emergency, Admitted Patient Care, Adult Critical Care, Outpatients
    • Mostly available for 2007/9-2023
  • Data on diagnosis, maternity, mortality, mental health, treatment’s length, deprivation etc.
  • Available for the NCDS Birth Cohort

Linking survey with digital trace data

  • Increasing usage of linked survey and social media data
  • Typical example: asking survey respondents to have their SM behaviour (digital trace) tracked
  • May reduce the cost of the survey (fewer questions to ask)
  • Subject to consent: representativeness issues
  • Consent depend on the app, gender, etc..
  • DIGISURVOR project for an exemple of current research
    • Linkage of existing survey data with online participation in political discussions

Smart data

  • Fuzzier definition
    • Digital records held by private sector organisations
    • Often but not always device-recorded data
    • Not traditionally associated with social research
    • Flow/ quasi real time data
  • Examples:
    • Data from fitness trackers, smart watches, and in-car smart tech \(\longrightarrow\) health and mobility research
    • Financial transactions by businesses and individuals loyalty cards, purchase records < banks, supermarkets\(\longrightarrow\) financial behaviour and resilience
    • Energy network data (distribution & consumption); EV usage and charging; smart meter readings \(\longrightarrow\) building/households energy consumptions\(\longrightarrow\) net zero targets

3. Which survey data are most commonly linked?

… In practice

  • Major longitudinal studies:

    • Birth cohort studies
    • Next Steps and ELSA
    • Understanding Society
  • A few large scale cross-sectional surveys such as:

    • ASHE (Annual Survey of Hours and Earnings)
    • Family Resources Survey
    • Scottish Health Survey (project)

Birth cohort studies

  • Follow a sample of individuals from birth onwards
  • Four so far: people born in 1958, 1970, 2000, 2026
  • Millenium Cohort Study (MCS)
    • ~ 19,000 children (born between June 2001 and Jan 03)
    • 7 ‘sweeps’: 9 months then at ages 3, 5, 7, 11, 14, 17, 23
    • Parent and child interviews
    • Focuses on education, skills and health, truancy, cognitive ability, biological measurements
    • … Traditional socio-economic and demographic data

Other cohort studies

  • Next Steps

    • AKA Longitudinal Study of Young People in England
    • 16,000 people in England born 1980-90, from secondary school age (i.e. 13-14) onwards
    • Set up by DfE to study determinants of school outcomes
  • ELSA (English Longitudinal Study of Ageing):

    • Follows a sample of 19,000 people aged over 50 to understand all aspects of ageing in England.
    • Started in 2002, biennial waves.
    • Data on physical and mental health (incl. well-being), financial circumstances, and attitudes about ageing.

Annual Survey of Hours and Earnings (ASHE)

  • Produced yearly by the Office for National Statistics
  • Sample drawn from NI records: typical n=135-190,000
  • Small number of variables
  • Very precise source of information for pay components and working hours
  • Can be linked to other business surveys, as well as PAYE and pensions data
  • Some data available via Administrative Data Research UK

A sample of the integrated datasets curated by UKDS

Next Steps: Student Loans Data

  • Data on higher education loans for Next Steps participant

    • who provided consent to linkage in the age 25 sweep.
  • Information about:

    • Full Next Steps dataset +

      • applications for student finance,
      • payment transactions & repayment details (via respondents’ accounts),
      • Overseas assessment.
  • Also hospitalisation episodes data (SN8681)

MCS: National Pupil Database

  • Data for children in England whose carer gave consent

  • Linked to National Pupil Database and the Pupil Level Annual School Census.

    • Pupil level school census data from N1 to year 11 (2016/17)
    • KS1, KS2, KS4 and KS5 results (Years 2, 6, 11, 12 and 13)
    • Absence data from year 1 to year 11
    • School characteristics and school changes: N1 to year 11
    • Anonymised School identifiers (URN) and anonymised Local Education Authorities (LEA)
  • Also available for Next Steps and Understanding Society

  • Also linked with Ofsted Reports data

Vacancy Survey 2005-2025

  • Statutory, monthly survey of ~6,000 GB businesses
  • Single question:
    • “How many job vacancies for which actively seeking recruits from outside their organisation?”
  • Sample drawn from the Inter-Departmental Business Register (HMT, collected from VAT and PAYE registers)
  • Via linkage ISCO code (industrial activities classification), number of employees
  • Additional linkage via IDBR possible - including ASHE

4. Data integration and skills requirements

Survey analysis skills

  • Traditional deterministic matching (ie merging)
    • The simplest case: individual level data matched to individual level data non ambiguous identifier
    • The same holds to aggregate level (for example smart sensor small area level matching)
  • Probabilistic matching
    • When separate ids
    • When data is not clean
  • Statistical inference & non random samples

Emerging skills

  • Computational skills
    • Non-survey cleaning (Pandas, Tidyverse)
    • Web scrapping (Python/R)
    • API queries (social media app: X, Reddit…)
    • Pattern detection ie random forest
  • Regulatory knowledge
    • Data protection & GDPR - prerequisite \(\longrightarrow\) UKDS Safe researcher training
    • Departmental regulations in case of Governmental data
    • Institutional/procedural: ie how to engage with the data matching intermediaries

5. Who’s who in the data integration landscape

Administrative Data Research (ADR)

  • Consortium of organisations, including the ONS, devolved governments and academic partners
  • Mission:
    • “link and open up de-identified data
    • generated from people’s interactions with public services,
    • making it securely available to accredited researchers.
  • Point of access for new data linkage within the public sector and between the public sector and researchers

UK Longitudinal Linking Consortium

  • Trusted Research Environment - TRE

  • Currently enables linkage between longitudinal studies and data from:

    • NHS England
    • Neighbourhood geographies
    • Address geographies
  • Planned: NHS Wales, Department for Work and Pensions, HM Revenue and Customs

Data producers

  • Data producers of the main longitudinal studies ie
    • Understanding Society (ISER)
    • Main cohort studies (CLS)
  • Involved in data matching (ea consent management)
  • Closer - coordination & cross studies harmonisation
  • Government departments and the ONS
  • Private sector organisartion

Other intermediaries

  • Smart Data Research UK
    • Gathers
    • Makes it available to the research community
    • Organised by kind of smart data
  • Secure Anonymised Information Linkage -SAIL
    • Wales based, but data from across the UK, (mostly) health-related
    • Trusted Research Environment - TRE

6. Data linkage at UKDS: roles, routes, and researcher options

The UK Data Service and data linkage:
core principles

  • UKDS does not create linkages or integrate data
  • Linked data are created by data owners or processors
  • UKDS negotiates access to these data collections and makes them research-ready and safely accessible
  • The type of linkage researchers can undertake depends on:
    • the access level (Open / Safeguarded / Controlled)
    • the presence or absence of identifiers

What researchers can do in UKDS SecureLab

  • Researchers can:
    • Access more granular variables​
    • Create derived or contextual linkages, for example:​
    • Environmental or pollution deciles based on postcode-derived measures​
    • Area-level deprivation or service access indicators​
    • Import external datasets subject to depositor approval​
  • Key considerations:
    • UKDS SecureLab does not host direct identifiers
    • Researchers cannot create linkage spines or perform identifier-based matching​
    • All linkage activity must be explicitly approved as part of the project

7. How to access linked data at UKDS

Conclusion so far

  • Potential for exciting lots of exciting new research, some of which is already happening
  • Dynamic landcape, changes in the role of actors may take place
  • Need for investigating capacity building/skills training
  • Please follow us for additional webinars on digital trace, administrative and smart data
  • Kind of data left out for now: Census (longitudinal & administrative linkage)

References

Millennium Cohort Study: Linked Education Administrative Datasets (National Pupil Database - KS1-KS5), England, 2003-2021: Secure Access

Next Steps: Linked Administrative Datasets (Student Loans Company Records), 2007 - 2021: Secure Access

Vacancy Survey, 2005-2025: Secure Access

Grant, P. (2024) The Monitored Self In: The Virtual Hospital Springer, Cham.

Kerry- Barnard, S., Mohamad Zaki, N.H., Gomes, D., Ploubidis, G., Sanchez-Galvez, A. (2025) National Child Development Study: A guide to the linked health administrative datasets – Hospital Episode Statistics (HES). User Guide (Version 3). London: UCL Centre for Longitudinal Studies.

Peters, A., Sanchez-Galvez, A., Fitzsimons, E., Gomes, D. (2025) Millennium Cohort Study: Linked education administrative datasets-Ofsted User Guide (Version 1) London: UCL Centre for Longitudinal Studies.

Silber, H., Breuer, J., Beuthner, C., Gummer, T., Keusch, F., Siegers, P., … Weiß, B. (2022). Linking Surveys and Digital Trace Data: Insights From two Studies on Determinants of Data Sharing Behaviour Journal of the Royal Statistical Society, Series A (Statistics in Society), 185(Suppl. 2), 387-407.

Whiffen, T; Akbari, A ; Paget, T ; Lowe, S; Lyons, R (2020) How effective are population health surveys for estimating prevalence of chronic conditions compared to anonymised clinical data?, International Journal of Population Data Science (IJPDS) Vol 5:1